Exploration of Streptococcus core genome to reveal druggable targets and novel therapeutics against S. pneumoniae

Streptococcus pneumoniae (S. pneumoniae), the major etiological agent of community-acquired pneumonia (CAP) contributes significantly to the global burden of infectious diseases which is getting resistant day by day. Nearly 30% of the S. pneumoniae genomes encode hypothetical proteins (HPs), and better understandings of these HPs in virulence and pathogenicity plausibly decipher new treatments. Some of the HPs are present across many Streptococcus species, systematic assessment of these unexplored HPs will disclose prospective drug targets. In this study, through a stringent bioinformatics analysis of the core genome and proteome of S. pneumoniae PCS8235, we identified and analyzed 28 HPs that are common in many Streptococcus species and might have a potential role in the virulence or pathogenesis of the bacteria. Functional annotations of the proteins were conducted based on the physicochemical properties, subcellular localization, virulence prediction, protein-protein interactions, and identification of essential genes, to find potentially druggable proteins among 28 HPs. The majority of the HPs are involved in bacterial transcription and translation. Besides, some of them were homologs of enzymes, binding proteins, transporters, and regulators. Protein-protein interactions revealed HP PCS8235_RS05845 made the highest interactions with other HPs and also has TRP structural motif along with virulent and pathogenic properties indicating it has critical cellular functions and might go under unconventional protein secretions. The second highest interacting protein HP PCS8235_RS02595 interacts with the Regulator of chromosomal segregation (RocS) which participates in chromosome segregation and nucleoid protection in S. pneumoniae. In this interacting network, 54% of protein members have virulent properties and 40% contain pathogenic properties. Among them, most of these proteins circulate in the cytoplasmic area and have hydrophilic properties. Finally, molecular docking and dynamics simulation demonstrated that the antimalarial drug Artenimol can act as a drug repurposing candidate against HP PCS8235_RS 04650 of S. pneumoniae. Hence, the present study could aid in drugs against S. pneumoniae.

Introduction Nevertheless, mutations in the mosaic genes encoding Penicillin-Binding Protein (PBP) results in resistant isolates [25]. Macrolide resistance and resistance to fluoroquinolones originate from the overuse of these broad-spectrum antibiotics. The level of macrolide resistance varies from region to region. North America and the United Kingdom are prone to low macrolide resistance by drug efflux whereas Asian countries are liable to high macrolide resistance as a result of ribosomal methylation [26]. Generally, fluoroquinolone is the only antibiotic used to target the DNA gyrase of S. pneumoniae directly bringing a halt to its protein synthesis. However, repeated use of this antimicrobial agent resulted in a spontaneous number of mutations in the chromosomal genes that encode this enzyme [27]. As a result, the growing resistance of S. pneumoniae to commonly used antibiotics and non-vaccine serotypes underlines the urgent need for a new therapeutic target.
Examination of the whole genome of the organism is an important approach in combating the regulation of antibiotic resistance, non-vaccine serotypes, and developments of therapeutics. When varieties of genus Streptococcus strains are aligned together, several core genes can be observed. Around 33% of the S. pneumoniae genome constitutes uncharacterized proteins documented as Hypothetical proteins (HPs) [28]. These proteins are encoded by computationally predicted open reading frames but lack biochemical and chemical evidence. Although they lack functional characterization, they play a crucial role in biochemical and physiological pathways [22]. Bioinformatics tools and algorithms are very efficient to explore proteins [29,30]. Previous studies have shown some of the Streptococcus mutans' HPs are critical for antibiotic resistance and biofilm formation [22,31]. Additionally, this infectious agent produces several virulence factors involved in the survival of the pathogen and the progression of the disease [28]. Therefore, in the present study, we aimed to characterize the HPs encoded by the core genome of S. pneumoniae using a computational approach to uncover novel targets for drug development. Since the HPs are mutual in many Streptococcus species they might be interesting targets.

Retrieval of the genome sequences
The core genome and proteome of Streptococcus strains from different Streptococci species were extracted using the Efficient Database framework for comparative Genome Analyses using BLAST score Ratios (EDGAR 3.0) and visualized by BioCircos [32]. EDGAR 3.0 is software designed to perform genome comparisons using a high throughput approach [33]. S. pneumoniae PCS8235 (NCBI Reference Sequence: NZ_CM001835.1) was taken as the reference strain. HPs were mined manually from the proteome datasets of the strains.

Functional enrichment and determination of physicochemical properties of the hypothetical proteins
The functions of the HPs were unveiled using the Gene Ontology Functional Enrichment Annotation Tool (GO FEAT) webserver which works through sequence homology search. The proteins were classified based on the conservation of domains, motifs, families, and superfamilies and categorized via the InterPro, UniProt, European Molecular Biology Laboratory (EMBL), Kyoto Encyclopedia of Genes and Genomes (KEGG), and the National Center for Biotechnology Information (NCBI) databases respectively [34]. The physicochemical properties of the annotated HPs were further documented using the ProtParam tool of Expasy (https://web.expasy.org/protparam/). The parameters of the proteins included the molecular weight, theoretical pI point, amino acid composition, extinction coefficient, instability index, aliphatic index, and grand average of hydropathicity (GRAVY) of the protein [35].

Analysis of subcellular localization and unconventional protein secretion
CELLO (http://cello.life.nctu.edu.tw/) and PSORTb 3.0 (https://www.psort.org/psortb/) were utilized to predict the subcellular localization of the HPs [36,37]. Understanding subcellular localization is very important to characterize a protein as a target for a drug or vaccine [38]. The OutCyte 1.0 (http://www.outcyte.com/) and SecretomeP 2.0 (http://www.cbs.dtu.dk/ services/SecretomeP/) were also used to reveal the unconventional protein secretion as well as the HPs taking non-classical secretory pathways respectively [39]. OutCyte 1.0 is an online bioinformatics tool that mediates two steps to finally generate the proteins without N-terminal signals [40].

Identification of essential proteins
All the HPs of S. pneumoniae were queried against the Database of Essential Genes (DEG) database (http://origin.tubic.org/deg/public/index.php) in search of homologous genes within the inquired sequence [41]. Only the proteins encoded by the query sequences which shared similarities with the essential genes in the DEG database and an E-value of <0.0001 and bitscore >100 as cutoff were designated as essential proteins.

Virulence properties and pathogenicity
VirulentPred (http://bioinfo.icgeb.res.in/virulent/) server was used to evaluate the virulence activity of the HPs. This online tool is an SVM-based method that calculates a virulence potential score for a given protein [42]. Simultaneously, the tool can distinguish virulent and nonvirulent proteins. Moreover, the pathogenic proteins were identified from the functionally annotated proteome using the MP3 tool [43]. This software exploits a combined SVM-HMM approach while identifying proteins from both genomic and metagenomic databases with high accuracy, efficiency, and sensitivity.

Protein-protein interaction network analysis
The protein-protein interaction (PPI) (https://string-db.org/) network between the functionally annotated HPs was visualized using the STRING database [44]. In this study, the available S. pneumoniae D39 in the STRING database was selected as the reference genome and an interconnected PPI network was constructed. The basis for the network lay in high-throughput lab experiments, gene expression data, and computational data. The network was built with the default confidence parameters.

Excavation of druggable proteins
The druggability of the annotated proteins was assessed through Drugbank BlastP [45]. Drugbank (https://go.drugbank.com/) is a comprehensive bioinformatics and cheminformatics resource containing relevant information about drugs and their corresponding targets.

Assessment of Structural proteins
Models for the selected proteins were obtained from Robetta (https://robetta.bakerlab.org/). Robetta analyzes the putative domains of the submitted sequences and generates 3-dimensional structural models [46]. The models were further evaluated using SWISS Structure assessment (https://swissmodel.expasy.org/). SWISS Structure assessment includes Ramachandran Plots and MolProbity scores [47]. Ramachandran Plots visualize energetically favored regions for backbone dihedral angles against the amino acids present in the protein structure while MolProbity examines the quality of protein models for both nucleic acids and proteins at global and local levels [48].

Molecular docking simulation
The Canonical Simplified molecular-input line-entry system (SMILES) of the interacted drugs were collected from DrugBank BLASTp. These canonical smiles were converted into PDB files via the CACTUS online smile translator (https://cactus.nci.nih.gov/translate/). The druggable proteins were then docked using AutoDock within PyRx software, which is a combination of several tools necessary for Molecular Docking [49]. The protein-ligand interactions were then visualized using Discovery Studio Visualizer (https://discover.3ds.com/).

Molecular Dynamics (MD) simulation
To evaluate the stability of the complexes under physiological conditions, 100 ns Molecular Dynamics (MD) simulation was carried out using GROningen MAchine for Chemical Simulations (GROMACS version 5.1.1). The GROMOS96 43a1 force-field was applied to the proteinligand complexes. The physiological condition of the system was defined as (300 K, pH 7.4, 0.9% NaCl). The structures were solvated in a dodecahedral box of the SPC (simple point charge) water model with its edges at a 1nm distance from the protein surface. The overall charge of the system was neutralized through the addition of 2 sodium ions using the genion module. Energy minimization of the neutralized system was carried out using the steepest descent minimization algorithm with a maximum number of minimization steps to perform was set at 50000. The ligand was restrained before carrying out the isothermal-isochoric (NVT) equilibration of the system for 100 ps with a short-range electrostatic cutoff value of 1.2 nm. Isobaric (NPT) equilibration of the system was carried out for 100 ps following the NVT with a short-range van der Waals cutoff fixed at 1.2 nm. Later, a 10 ns molecular dynamic simulation was run using periodic boundary conditions and a time integration step of 2 fs. The energy of the system was saved every 100 ps. For calculating the long-range electrostatic potential, the Particle Mesh Ewald (PME) method was applied. The short-range van der Waals cutoff was kept at 1.2 A modified Berendsen thermostat was used to control simulation temperature while the pressure was kept constant using the Parrinello-Rahman algorithm. The simulation time step was selected as 2.0 fs. The snapshot interval was set to 100 ps for analyzing the trajectory data. Finally, all of the trajectories were concatenated to calculate and plot root mean square deviation (RMSD), root mean square fluctuation (RMSF), the radius of gyration (Rg), and solvent accessible surface area (SASA) data. Root Mean Square Deviation (RMSD) calculation was performed to evaluate when a system attains equilibrium. The "rms" module built into the GROMACS software was utilized to extract RMSD information throughout the simulation. The results were plotted graphically using the ggplot2 package of R (https://ggplot2. tidyverse.org/). Room Mean Square Fluctuation (RMSF) is used to determine the flexibility of a certain region of the protein. The radius of gyration of our proteins was measured to determine their degree of compactness. A relatively steady value of the radius of gyration means stable folding of a protein. Fluctuation of the radius of gyration implies the unfolding of the protein. The "gyrate" module was used to generate the radius of gyration graphs for our proteins.
Hydrophobic interactions composed of non-polar amino acids are crucial for maintaining the stability of the hydrophobic core of proteins. They do so by covering the non-polar amino acids within the hydrophobic cores and keeping them at a distance from the solvent. Solvent Accessible Surface Area (SASA) is used in MD simulations to predict the hydrophobic core stability of proteins.

Results
In the study, the core genome of S. pneumoniae that encodes hypothetical proteins (HPs) was computationally annotated and analyzed to identify potential drug targets. The workflow has been illustrated in Fig 1.

Identification of proteins from core and pan genome
Out of 22 different Streptococci species from EDGAR 3.0, 9084 genes from pan genome were identified and several were visualized by the BioCircos tool. Subsequently, 498 genes were extracted from the core genome (S1 Data). The core genome dataset of S. pneumoniae encoded 28 HPs (Fig 2) (S2 Data).

Functional analysis of the hypothetical proteins
The gene ontology (GO) analysis of the 28 HPs revealed that 21 annotated proteins had been involved in biological processes (BPs), molecular functions (MFs) and cellular components (CCs) and the remaining belonged to uncharacterized domains/ protein families (S3 Data). The HPs corresponded to 20 GO terms in total while some of the HPs possessed more than one GO term (Fig 3 and Table 1). HPs were homologs of enzymes, binding proteins, transporters, and regulators (Fig 3). A high proportion of HPs exhibited MFs such as aminoacyl-tRNA ligase activity, RNA binding, DNA binding, hydrolase activity, transferase activity, tRNA binding, methylation, nucleic acid binding, ATP binding, metal ions binding, Flavin adenine dinucleotide binding, tRNA modification, and metallopeptidase activity, indicating they may take

Physicochemical characterization
The physicochemical properties of the 28 HPs are provided in Table 2. The length of the analyzed sequence ranged from 71 to 560 AA with a molecular weight (MW) ranging from 8462.34 and 64392.6 Da. The isoelectric point (pI) of the proteins ranged from 3.92 to 10.07 with an average value of 6.20. The isoelectric point determines the protein load. At this pH, the protein does not have any charge and therefore does not move in the electric field when a direct current is passed through it [50]. MW and pI parameters are important for purification and crystallization during in vivo experimentation. The extinction coefficient measures the amount of light that a protein absorbs at a certain wavelength. In silico identification of the extinction coefficient is necessary to evaluate the study of protein-protein interaction in solution [51]. The extinction coefficient of the provided hypothetical proteins ranged from 1615 to 57190 at 280 nm. Here, HPs possessed high extinction coefficient values that indicate the proteins constitute a high concentration of cysteine, tyrosine, and tryptophan. The instability index determines the stability of the protein in a test tube. An instability index value of less than 40 predicts the protein to be stable and over 40 adopts the protein to be unstable [52]. With variable ranges of instability index, more than 50% of the proteins were noticed to be stable with a mean value of 39.21 in this study. The aliphatic index estimates the thermostability of the protein depending on the area occupied by the aliphatic chain. Proteins with high aliphatic index show stability in high temperatures, whereas proteins with low aliphatic value are less flexible and thermally unstable [53]. Here, the aliphatic index of the proteins ranged from 32.66 to 1135.05 with a mean value of 92. 16. Most of the proteins were observed with an aliphatic index value above 40 which means most of them are thermally stable. The GRAVY of the protein indicates the interaction of proteins with water. It is calculated by adding all the hydropathy values of the amino acids followed by dividing the number of residues in the sequence. The outcome for GRAVY predicts the presence of the protein-water interactions between all the proteins except two. Therefore, the majority of them are water-soluble.

Prediction of essential proteins
Identification of essential genes contributes a principal role in the development of new drugs or vaccines that inhibit the dissemination of resistant strains [54]. Through a similarity search against the essential proteins of bacteria present in the DEG database, 16 HPs were considered as essential proteins with an E-value<0.0001 and bit score>100 as shown in Table 3.

Virulence factors and pathogenicity
Virulence factors are the basis of bacterial infections. The virulent and pathogenic proteins help microbes invade the host and manipulate the host immune system for its survival [55].
The comprehensive predictive data for virulence is provided. The VirulentPred server identified 54% of the HPs as having virulent properties while 46% owned non virulent properties. Moreover, the MP3 webserver recognized 8 proteins out of 28 as pathogenic. Among these 8 proteins, 4 portrayed both virulence and pathogenic properties (Fig 4) (S4 Data).

Examination of the subcellular localization and unconventional protein secretion
During drug discovery, knowledge of subcellular localization can be leveraged to significantly improve the identification of druggable targets [56]. Proteins are categorized as drugs or Table 3. Essential proteins as identified by the DEG database. More than half of the proteins were found to be essential. vaccine targets based on subcellular location. Considering the results of the CELLO and PSORTb 3.0 tools, the majority of the proteins were characterized as cytoplasmic and 6 as membrane-bound with a sole exception being located in the extracellular matrix (Table 4). This implies that most of the proteins prefer the cytoplasmic region for their activity. Furthermore, OutCyte 1.0 data predicted that 64% of all the HPs undergo potential unconventional protein secretion (UPS) and the rest possessed signal peptide, the transmembrane domain, and intracellular secretion. Additionally, SecretomeP 2.0 identified 20 non classical secreted proteins. These proteins lacked a signal peptide and had a threshold SecP value above 0.5. All the predicted results for CELLO, PSORTb 3.0, OutCyte 1.0, and SecretomeP 2.0 are presented in Table 4.

Druggability of the proteins
The druggability prediction of HPs classified three proteins with homologous genes (PCS8235_RS02820, PCS8235_RS04650, and PCS8235_RS10805) available in the Drugbank database (Table 5). DrugBank contains arrays of gene families that have been successfully targeted by drugs with required affinity. One of the HPs (PCS8235_RS04650) had two homologous compounds. Three of the predictive druggable proteins were filtered into two (PCS8235_RS02820, PCS8235_RS04650) based on a cut-off E-value of <0.0001 and bit-score >100.

Protein model building
Robetta predicted 5 models for each druggable protein. Robetta modeling program was implemented to predict the best model among all. The Ramachandran plot percentage and the Molprobity score of the proteins are tabulated below (S6 Data). MolProbity score should be as low as possible and Ramachandran plot >98% is favored for an ideal protein structure. Model 2 for both of the proteins was selected using these criteria. Model 2 of both of the proteins (PCS8235_RS02820 and PCS8235_RS04650) were selected based on Ramachandran plot percentages of 98.77% and 99.24% with a low morbidity score of 2.64 and 2.69 respectively.

Elucidation of molecular docking
Blind docking was performed to evaluate the binding energy between each ligand and receptors. Two of the expected potential HPs, PCS8235_RS02820 and PCS8235_RS04650 were docked against the interacted drugs from DrugBank (Fig 6). The binding affinity between druggable targets and commercially available drugs was satisfactory. Protein PCS8235_RS02820 docked against 2-(N-morpholino) ethanesulfonic acid with docking energy of -4.4 kcal/mol. Protein PCS8235_RS 04650 bonded with Artenimol and Aspartate beryllium trifluoride with the binding affinity of -9.5 kcal/mol and -4.7 kcal/mol respectively.

Examination of molecular dynamics simulation
The RMSD graph of PCS8235_RS02820 and PCS8235_RS02820-drug complex demonstrated that the backbone of the proteins started to deviate around 12 ns and overlapped around 75 ns https://doi.org/10.1371/journal.pone.0272945.g005 (Fig 7). The RMSF and SASA graphs of this protein showed that the binding of the drug altered mobility and solvent accessibility in the protein (Figs 8 and 9). Rg analysis unveiled that the compactness of the protein and drug-protein complex was most of the time similar during the simulations (Fig 10). For PCS8235_RS04650 and PCS8235_RS04650-drug bound complexes also gave similar results. The RMSD backbones did not overlap most of the time through 100ns. The RMSF values or mobility with protein compactness (Rg) of the structures altered due to drugs. Slight solvent accessibility was observed in SASA analysis.

Discussion
S. pneumoniae has been the prime cause of death via community-acquired pneumonia (CAP) worldwide [1]. Combating the mortality and morbidity rate of CAP is becoming more challenging due to the increasing prevalence of antimicrobial resistance [57]. Following the drastic rise in antimicrobial resistance, current research on producing new antibiotics seems timeconsuming, and costly. For several years, genomics and evolutionary studies on this organism were endeavored to determine new druggable targets [58,59]. Yet, few studies have been published on the unexplored realm of proteins known as HPs. About nearly 30% of S. pneumoniae encodes HPs. Some homologs of HPs are conserved within various streptococci species [60].
In the present study, while analyzing the core genome of the Streptococcus species, we have found 498 core genes among 22 Streptococci species where 28 belonged to HPs (around 5.6%).
Since they are widely distributed in the Streptococcus genus through evolution, they might be critical for the survival of the bacteria. Proper characterization of these HPs could give plausible targets against S. pneumoniae pathogen. Previous studies demonstrated that HPs play a vital role in biofilm formation, pathogenesis, and virulence of tooth decaying S. mutans [61] whereas, in this study, we have also identified the protein PCS8235_RS05215 that plays a key role in biofilm formation. Understanding the functional annotation of these HPs is essential to comprehending the molecular and biological mechanisms of all species. Bioinformatics approaches aiding functional annotation and computational drug designing can be a better way to identify the potential drug targets from these HPs. When the core HPs went under gene ontology-based characterization, 75% of the proteins revealed their characteristic. Most of them were enzymes and binding proteins. Enzymes are involved in several biochemical reactions and self-defense mechanisms playing a key role in bacterial survival inside the host. Here, protein PCS8235_RS3915 and PCS8235_RS4650 were involved in hydrolase activity. Protein PCS8235_RS3915 is a DNA repair RadC homolog while PCS8235_RS4650 is a haloacid dehalogenase homolog. Haloacid dehalogenase homologs are a superfamily of enzymes involved in cellular functions starting from amino acid biosynthesis to detoxification [62,63]. In prokaryotes, DNA repair RadC homolog helps in DNA damage repair after UV and X-ray radiation [64]. In the same way, protein PCS8235_RS2800 is related to branched-chain amino acid aminotransferase activity and protein PCS8235_RS6025 is associated with methyltransferase activity. Hence, enzymes can be assumed as a druggable target for different therapeutics. The preceding proteins (PCS8235_RS01335, PCS8235_RS01340, PCS8235_RS02405, PCS8235_RS03345, PCS8235_RS06025 and PCS8235_RS06945) also take part in nucleic acid-binding. Other proteins (PCS8235_RS02405, PCS8235_RS03345, PCS8235_RS06025, and PCS8235_RS06945) were somehow involved in the central dogma of the bacterial pathogen. Protein PCS8235_RS05215 denoted transcriptional regulator YlbF, YmcA that formed a complex with YaaT regulated competence and biofilm formation in Bacillus subtilis [65]. Furthermore, targeted proteins must be essential to the concerned pathogen since essential genes strongly support the cellular life of living microorganisms [66]. For the study, more than half of the proteins were identified as essential proteins of S. pneumoniae. Later, the virulence and pathogenicity of the essential HPs were noted. Out of the 19 proteins analyzed for virulence and pathogenic characteristics, 21% were found to be both virulent and pathogenic.
Moreover, the integrated roles of the bacterial proteins were elucidated via protein-protein interactions. Protein-protein interaction is required to understand the specific functions of bacterial protein which might be influenced by the neighboring proteins. This helps in finding its significance in bacterial physiology. In the provided network, the protein PCS8235_RS05845 formed the highest nodes (8 interactions) with other HPs that might have critical cellular functions and the potential to go under unconventional protein secretions. Most of the HPs play a physiological role in the cytoplasmic region and have hydrophilic properties. Therefore, they can be considered as drug targets rather than vaccines [29,67]. This cytoplasmic characteristic was evaluated by subcellular localization and physicochemical properties. PCS8235_RS05845 also has TRP structural motif along with virulent and pathogenic properties. Previous studies reported that some TRP-containing proteins are involved directly in virulence-associated functions, especially, in maintaining proper functions of the type III secretion system and class II chaperones [67]. The next highest interactive protein is PCS8235_RS02595 (7 interactions). Although this protein is uncharacterized still now, its interaction with other proteins can suggest its significant role in the network. PCS8235_RS02595 is interconnected to Pneumococcal vaccine antigen A (PCS8235_RS03660), cell division protein Zap A (PCS8235_RS00650), putative Anti-CRISPR (PCS8235_RS04815) and Regulator of chromosomal segregation (RocS) (PCS8235_RS03470) proteins. Zap A is a Z-ring-associated protein found in Bacillus subtilis [68]. This protein binds to tubulin-like protein FtsZ at cytokinesis and therefore takes part during bacterial cell division [69]. Putative Anti-CRISPR may inhibit the bacterial CRISPR-Cas system. RocS participates in chromosome segregation and nucleoid protection in S. pneumoniae [70].
The druggability of the essential HPs was checked to ensure the proteins' susceptibility toward the active site of the small inhibitor molecules. In other studies, the drug targets in S. pneumoniae were explored; however, it has only screened the potential drug targets for the pathogen [71]. Here, only the compounds with both druggability and essentiality are chosen for further analysis. The 3D structure of the targeted proteins was then analyzed and the best model was selected for molecular docking. The basis of docking was to identify the relationship between the hypothetical proteins and available druggable ligands. A blind docking test for each of the whole proteins with expected drug candidates reveals satisfactory drug affinity towards the ligands. The strongest binding affinity could be noticed between PCS8235_RS04650 and Artenimol with the binding affinity of -9.5 kcal/mol. MD simulations also support that these druggable compounds will change the targeted bacterial proteins by impeding their general physiology. Therefore, this predictive essential protein could be an eligible drug target for the development of new pneumococcal therapeutics.

Conclusion
This study focused on the HPs that are found in the core genome of S. pneumoniae and have a major role in the survival, virulence, and pathogenesis of the pathogen. The majority of the observed proteins are mainly involved in bacterial transcription and translation while a high proportion of the proteins were recognized as essential proteins. Hence they make up good targets for broad spectrum anti-streptococcal drug development.